{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# JSON Processing with fenic\n",
    "\n",
    "This example demonstrates how to use fenic to process, analyze, and extract structured information from a complex JSON transcript file (such as those produced by speech-to-text systems like Whisper). \n",
    "\n",
    "The workflow covers the entire pipeline—from loading and casting the JSON data, to extracting word- and segment-level details, and aggregating speaker statistics using both JQ queries and DataFrame operations.\n",
    "\n",
    "**Key steps include**:\n",
    "- Loading a JSON transcript and casting it to a structured JSON type.\n",
    "- Using JQ queries to extract nested word- and segment-level data.\n",
    "- Structuring and cleaning extracted data with type casting and calculated fields.\n",
    "- Aggregating speaker statistics, such as total words, speaking time, and word rates.\n",
    "- Demonstrating hybrid processing: combining JSON extraction, array operations, and traditional DataFrame analytics.\n",
    "\n",
    "This notebook provides a practical example of how fenic can transform unstructured JSON data into structured, queryable DataFrames for further analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting Up the fenic Session\n",
    "\n",
    "This cell configures and initializes a Fenic session with semantic capabilities, enabling the use of a language model for advanced JSON document analysis and extraction tasks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import fenic as fc\n",
    "\n",
    "config = fc.SessionConfig(\n",
    "    app_name=\"json_processing\",\n",
    "    semantic=fc.SemanticConfig(\n",
    "        language_models={\n",
    "            \"mini\": fc.OpenAIModelConfig(\n",
    "                model_name=\"gpt-4o-mini\",\n",
    "                rpm=500,\n",
    "                tpm=200_000\n",
    "            )\n",
    "        }\n",
    "    )\n",
    ")\n",
    "\n",
    "# Create session\n",
    "session = fc.Session.get_or_create(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading and Preparing the JSON Transcript\n",
    "\n",
    "This cell loads the transcript from a JSON file, creates a Fenic DataFrame containing the raw JSON string, and casts it to a structured JSON type to enable further analysis and extraction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "transcript_path = Path(\"whisper-transcript.json\")\n",
    "\n",
    "with open(transcript_path, \"r\") as f:\n",
    "    json_content = f.read()\n",
    "\n",
    "# Create dataframe with the JSON string\n",
    "df = session.create_dataframe([{\"json_string\": json_content}])\n",
    "\n",
    "# Cast the JSON string to JSON type\n",
    "df_json = df.select(\n",
    "    fc.col(\"json_string\").cast(fc.JsonType).alias(\"json_data\")\n",
    ")\n",
    "\n",
    "df_json.show(1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting Word-Level Data Using JQ\n",
    "\n",
    "This cell uses a JQ query to extract all individual words from each segment in the JSON transcript. \n",
    "\n",
    "It demonstrates nested array traversal and variable binding, producing a DataFrame where each row contains both word-level and segment-level information.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract all words from all segments using JQ\n",
    "# This demonstrates nested array traversal and variable binding in JQ\n",
    "words_df = df_json.select(\n",
    "    fc.json.jq(\n",
    "        fc.col(\"json_data\"),\n",
    "        # JQ query explanation:\n",
    "        # - '.segments[] as $seg' iterates through segments, binding each to $seg\n",
    "        # - '$seg.words[]' iterates through words in each segment\n",
    "        # - Constructs object with both word-level and segment-level data\n",
    "        '.segments[] as $seg | $seg.words[] | {word: .word, speaker: .speaker, start: .start, end: .end, probability: .probability, segment_start: $seg.start, segment_end: $seg.end, segment_text: $seg.text}'\n",
    "    ).alias(\"word_data\")\n",
    ").explode(\"word_data\")  # Convert array of word objects into separate rows\n",
    "words_df.show(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Structuring and Cleaning Word-Level Data\n",
    "\n",
    "This cell defines a schema for word-level data, casts the extracted word objects to this schema, and unnests the fields for clarity. \n",
    "\n",
    "The result is a clean DataFrame with properly typed and named columns for each word and its associated metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract scalar values using struct casting and unnest - more efficient than JQ + get_item(0)\n",
    "# Define schema for word-level data structure\n",
    "word_schema = fc.StructType([\n",
    "    fc.StructField(\"word\", fc.StringType),\n",
    "    fc.StructField(\"speaker\", fc.StringType),\n",
    "    fc.StructField(\"start\", fc.FloatType),\n",
    "    fc.StructField(\"end\", fc.FloatType),\n",
    "    fc.StructField(\"probability\", fc.FloatType),\n",
    "    fc.StructField(\"segment_start\", fc.FloatType),\n",
    "    fc.StructField(\"segment_end\", fc.FloatType)\n",
    "])\n",
    "\n",
    "# Cast to struct and unnest to automatically extract all fields\n",
    "words_clean_df = words_df.select(\n",
    "    fc.col(\"word_data\").cast(word_schema).alias(\"word_struct\")\n",
    ").unnest(\"word_struct\").select(\n",
    "    # Rename fields for clarity\n",
    "    fc.col(\"word\").alias(\"word_text\"),\n",
    "    fc.col(\"speaker\"),\n",
    "    fc.col(\"start\").alias(\"start_time\"),\n",
    "    fc.col(\"end\").alias(\"end_time\"),\n",
    "    fc.col(\"probability\"),\n",
    "    fc.col(\"segment_start\"),\n",
    "    fc.col(\"segment_end\")\n",
    ")\n",
    "\n",
    "print(\"\\nScalar extracted fields:\")\n",
    "words_clean_df.show(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Adding Calculated Fields to Word Data\n",
    "\n",
    "This cell demonstrates how to add calculated fields to the word-level DataFrame. \n",
    "\n",
    "Specifically, it computes the duration of each word (end time minus start time), showcasing arithmetic operations on structured data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add calculated fields - types are already correct from struct schema\n",
    "# This demonstrates arithmetic operations on struct-extracted data\n",
    "words_final_df = words_clean_df.select(\n",
    "    \"*\",\n",
    "    # Calculate duration: end_time - start_time (demonstrates arithmetic on struct data)\n",
    "    (fc.col(\"end_time\") - fc.col(\"start_time\")).alias(\"duration\")\n",
    ")\n",
    "\n",
    "print(\"\\n📊 Words DataFrame with calculated duration:\")\n",
    "\n",
    "words_final_df.show(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating a Segments DataFrame\n",
    "\n",
    "This cell extracts segment-level data from the JSON transcript using a JQ query. \n",
    "\n",
    "Each row in the resulting DataFrame represents a segment, including its text, timing, and associated words, enabling analysis at a higher granularity than the word level."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 2. Create Segments DataFrame (Content-focused)\n",
    "print(\"\\n📝 Creating Segments DataFrame...\")\n",
    "\n",
    "# Extract segment-level data using JQ\n",
    "# This demonstrates extracting data at a different granularity level\n",
    "segments_df = df_json.select(\n",
    "    fc.json.jq(\n",
    "        fc.col(\"json_data\"),\n",
    "        # Extract segment objects with their text, timing, and nested words array\n",
    "        '.segments[] | {text: .text, start: .start, end: .end, words: .words}'\n",
    "    ).alias(\"segment_data\")\n",
    ").explode(\"segment_data\")  # Convert segments array into separate rows\n",
    "\n",
    "print(f\"Extracted {segments_df.count()} segments\")\n",
    "segments_df.show(3)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Structuring and Aggregating Segment-Level Data\n",
    "\n",
    "This cell defines a schema for basic segment fields and uses a hybrid approach—combining struct casting and JQ queries—to extract, aggregate, and calculate metrics for each segment. \n",
    "\n",
    "The resulting DataFrame includes segment text, timing, word count, average confidence, and duration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Extract segment fields using hybrid approach: struct casting + JQ for complex aggregations\n",
    "# Define schema for basic segment fields (text, start, end)\n",
    "segment_basic_schema = fc.StructType([\n",
    "    fc.StructField(\"text\", fc.StringType),\n",
    "    fc.StructField(\"start\", fc.FloatType),\n",
    "    fc.StructField(\"end\", fc.FloatType)\n",
    "])\n",
    "\n",
    "# First extract basic fields using struct casting, then add complex JQ aggregations\n",
    "segments_clean_df = segments_df.select(\n",
    "    # Extract basic segment data using struct casting (more efficient)\n",
    "    fc.col(\"segment_data\").cast(segment_basic_schema).alias(\"segment_struct\"),\n",
    "    # Complex array aggregations still use JQ (best tool for this)\n",
    "    fc.json.jq(fc.col(\"segment_data\"), '.words | length').get_item(0).cast(fc.IntegerType).alias(\"word_count\"),\n",
    "    fc.json.jq(fc.col(\"segment_data\"), '[.words[].probability] | add / length').get_item(0).cast(fc.FloatType).alias(\"average_confidence\")\n",
    ").unnest(\"segment_struct\").select(\n",
    "    # Rename for clarity\n",
    "    fc.col(\"text\").alias(\"segment_text\"),\n",
    "    fc.col(\"start\").alias(\"start_time\"),\n",
    "    fc.col(\"end\").alias(\"end_time\"),\n",
    "    fc.col(\"word_count\"),\n",
    "    fc.col(\"average_confidence\")\n",
    ").select(\n",
    "    \"segment_text\",\n",
    "    \"start_time\",\n",
    "    \"end_time\",\n",
    "    # Calculate segment duration using DataFrame arithmetic\n",
    "    (fc.col(\"end_time\") - fc.col(\"start_time\")).alias(\"duration\"),\n",
    "    \"word_count\",\n",
    "    \"average_confidence\"\n",
    ")\n",
    "\n",
    "print(\"\\n📊 Segments DataFrame with calculated metrics:\")\n",
    "segments_clean_df.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregating Speaker Statistics\n",
    "\n",
    "This cell creates a summary DataFrame that aggregates statistics for each speaker, such as total words spoken, speaking time, average confidence, and word rate. \n",
    "\n",
    "It demonstrates hybrid processing by combining JSON-extracted data with traditional DataFrame analytics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 3. Create Speaker Summary DataFrame (Aggregated)\n",
    "print(\"\\n🎤 Creating Speaker Summary DataFrame...\")\n",
    "\n",
    "# Use traditional DataFrame aggregations on JSON-extracted data\n",
    "# This demonstrates hybrid processing: JSON extraction + DataFrame analytics\n",
    "speaker_summary_df = words_final_df.group_by(\"speaker\").agg(\n",
    "    fc.count(\"*\").alias(\"total_words\"),                    # Count words per speaker\n",
    "    fc.avg(\"probability\").alias(\"average_confidence\"),     # Average speech confidence\n",
    "    fc.min(\"start_time\").alias(\"first_speaking_time\"),     # When speaker first appears\n",
    "    fc.max(\"end_time\").alias(\"last_speaking_time\"),        # When speaker last appears\n",
    "    fc.sum(\"duration\").alias(\"total_speaking_time\")        # Total time speaking\n",
    ").select(\n",
    "    \"speaker\",\n",
    "    \"total_words\", \n",
    "    \"total_speaking_time\",\n",
    "    \"average_confidence\",\n",
    "    \"first_speaking_time\",\n",
    "    \"last_speaking_time\",\n",
    "    # Calculate derived metric: words per minute\n",
    "    (fc.col(\"total_words\") / (fc.col(\"total_speaking_time\") / 60.0)).alias(\"word_rate\")\n",
    ")\n",
    "\n",
    "print(\"\\n📊 Speaker Summary DataFrame:\")\n",
    "speaker_summary_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline Summary and Key Features\n",
    "\n",
    "This cell summarizes the entire JSON processing pipeline, highlighting the main inputs, outputs, and key features demonstrated in the notebook. \n",
    "\n",
    "It reviews the creation of structured DataFrames for words, segments, and speakers, and lists the core Fenic features used throughout the workflow.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Summary of what we accomplished\n",
    "print(\"\\n🎯 JSON Processing Pipeline Summary:\")\n",
    "print(\"=\" * 60)\n",
    "print(\"📁 Input: Single JSON file (whisper-transcript.json)\\n\")\n",
    "print(\"📊 Output: 3 structured DataFrames\")\n",
    "print()\n",
    "print(\"1. 🔤 Words DataFrame:\")\n",
    "print(f\"   - {words_final_df.count()} individual words extracted\")\n",
    "print(\"   - Fields: word_text, speaker, timing, probability, duration\")\n",
    "print(\"   - Demonstrates: JQ nested array extraction, type casting\")\n",
    "print()\n",
    "print(\"2. 📝 Segments DataFrame:\")\n",
    "print(f\"   - {segments_clean_df.count()} conversation segments\")\n",
    "print(\"   - Fields: text, timing, word_count, average_confidence\")\n",
    "print(\"   - Demonstrates: JQ aggregations, array operations\")\n",
    "print()\n",
    "print(\"3. 🎤 Speaker Summary DataFrame:\")\n",
    "print(f\"   - {speaker_summary_df.count()} speakers analyzed\")\n",
    "print(\"   - Fields: totals, averages, speaking patterns, word rates\")\n",
    "print(\"   - Demonstrates: DataFrame aggregations on JSON-extracted data\")\n",
    "print()\n",
    "print(\"🔧 Key Fenic JSON Features Used:\")\n",
    "print(\"   ✓ JSON type casting from strings\")\n",
    "print(\"   ✓ JQ queries for complex nested extraction\")\n",
    "print(\"   ✓ Array operations and aggregations\")\n",
    "print(\"   ✓ Type conversion and calculated fields\")\n",
    "print(\"   ✓ Traditional DataFrame operations on JSON data\")\n",
    "\n",
    "# Clean up\n",
    "session.stop()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
